Why do we need to visualise data?
Why R?
Data exploration
Basic plotting using base R tools
Using ggplot2 to create more customisable
plots.
Creating interactive plots using plotly and
highcharter
Max Roser (2018) - “Twice as long — life expectancy around the world” Published online at OurWorldinData.org. Retrieved from: ‘https://ourworldindata.org/life-expectancy-globally’
It is the language I know best!
R is a powerful language for statistical computing and graphics. There are loads of open source packages that allow different levels of customisation of charts, depending on your requirements.
Today I’ll talk about some very popular packages
ggplot2, ggplotly and
highcharter, which allow for very detailed customisation of
plots and interactive functions. A couple of years ago Graham and Adnan
did C&Cs all about plotting data using ggplot2 and
highcharter. I will cover a little bit of the same ground
but for more dedicated tutorials I would have a look at those (links at
the end).
Who is this for? Anyone that has an interest in plotting data in R. I will start from extremely basic plotting and move on to more ‘fancy’ plots. So, if you are completely new to R and want to start using R for your visualisations I hope this C&C gives you a good place to start. However, if you are well versed in R but haven’t used interactive plots before I hope this will be useful for you too. Regardless, if you feel like you might have anything to add please feel free to chime in in the chat and we can share our knowledge.
You don’t have to use R, there are other tools and coding
languages in which can do the same thing, some of them even share the
same syntax across coding languages (ggplot2 /
plotnine in Python).
To keep things simple, I’ve tried to keep the packages to a minimum
and we will be using one dataset all the way through, which is from the
package gapminder. It contains global life expectancy data
from 1952 - 2007 at 5 year intervals. We will use base R functions and 3
additional packages for the visualisations: ggplot2,
ggplotly and highcharter.
#install.packages(c("NHSRtheme", "nhsbsaR", skimr", "gapminder", "ggplot2", "plotly", "highcharter"))
# load packages:
# Helper packages
library(NHSRtheme) # for NHS colour palette
library(nhsbsaR) # for NHSBSA ggplot and highcharter themes
library(skimr) # for data summary statistics
# Plotting packages
library(ggplot2)
library(plotly)
library(highcharter)
# Dataset we will be using
library(gapminder)
We can now load the data from gapminder:
# Let's load the data from the gapminder package:
data(gapminder)
It’s always a good idea to have a look at the data before you go ahead and plot it, just to see if there’s anything that you need to consider. It might help you decide which type of plot to use.
We can have a quick look at data types, missingness and whether we suspect there may be any outliers or uneven distribution.
You could use the summary() function for this, however a
function I prefer to use is skim() from the
skimr package. Not only does it return standard summary
statistics such as mean, standard deviation and quartiles, it gives you
the proportion of missing data for each variable and even a (small)
histogram so that you can have a quick look at the distribution too.
skimr::skim(gapminder)
| Name | gapminder |
| Number of rows | 1704 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
| continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
| lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
| pop | 0 | 1 | 29601212.32 | 106157896.74 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
| gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
By summarising we can see that gapminder is a dataset of
life expectancy values, gdp and population size (all numeric). The data
can be grouped by country, continent and year.
To explore the dataset we can have a look at some basic plotting functions.
hist()
function:hist(gapminder$lifeExp)
plot() function:plot(gapminder$gdpPercap, gapminder$lifeExp)
# it probably needs a logarithmic x axis:
plot(lifeExp ~ gdpPercap, data = gapminder, log = 'x')
# and a change of x axis label:
plot(lifeExp ~ gdpPercap, data = gapminder, log = 'x', xlab = "GDP per capita (Log-scale)")
boxplot() for a boxplotFor the next few plots we will filter the gapminder
dataset to the most recent year (2007) to simplify things:
gap_2007 <- gapminder |> filter(year == 2007) # Filtering to a specific year
boxplot(
lifeExp ~ continent,
data = gap_2007
)
There are plenty of other base R plotting funtions too including barplot, pie, stripchart, line plots etc. For a good resource on base R plotting functions see here:https://intro2r.com/custom_plot.html.
We have options to change some parameters to customise and tidy up the basic plots:
boxplot(
lifeExp ~ continent,
data = gap_2007,
pch = 2, # symbols used for plotting points
lty = 1, # line type, solid, dash, dot-dash etc
lwd = 3, # linewidth
col = "lightgreen", # fill colour
border = "darkgreen", # border colour
main = "Life Expectancy by Continent (2007)", # main title
xlab = "Continent", # x-axis title
ylab = "Life Expectancy" # y-axis title
)
A list of base colours that can be referenced by name can be found here: https://r-graph-gallery.com/42-colors-names.html. Alternatively hexadecimal codes can be used for specific colours. We do have a recommended colour palette to use to keep our visualisations in line with NHS identity. Details of the hexadecimal codes for the recommended colours can be found here.
Alternatively, in R we can use the get_nhs_colours()
function from the NHSRtheme package.
NHSRtheme::get_nhs_colours()
## DarkBlue Blue BrightBlue LightBlue AquaBlue Black DarkGrey
## "#003087" "#005EB8" "#0072CE" "#41B6E6" "#00A9CE" "#231f20" "#425563"
## MidGrey PaleGrey DarkGreen Green LightGreen AquaGreen Purple
## "#768692" "#E8EDEE" "#006747" "#009639" "#78BE20" "#00A499" "#330072"
## DarkPink Pink DarkRed Red Orange WarmYellow Yellow
## "#7C2855" "#AE2573" "#8A1538" "#DA291C" "#ED8B00" "#FFB81C" "#FAE100"
We can create a palette that we can reference when plotting:
nhs_palette <- NHSRtheme::get_nhs_colours()
boxplot(
lifeExp ~ continent,
data = gap_2007,
col = nhs_palette["LightBlue"],
border = nhs_palette["DarkBlue"],
main = "Life Expectancy by Continent (2007)",
xlab = "Continent",
ylab = "Life Expectancy"
)
There are many other things that you can change, but we are nearing the limits of how much you would likely customise a base R plot. If you want to read a bit more about base R plotting there are loads of great guides online - this one is good: https://intro2r.com/custom_plot.html
There are other R packages however that have been specifically designed to be flexible and easy to use and build up very nice quality plots.
ggplot2 is one of the most popular packages for creating
publication-quality visualisations. It is based around the ‘Grammar of
Graphics’ which is essentially a set of building blocks and rules
necessary to create a plot. It allows us to layer up elements of a plot
until it has everything we need and looks how we want it to. Once you
know the basics it is fairly intuitive to use. You need three main
elements to create a plot:
Data: self explanatory
Aesthetics (aes): The variables you
want to plot. These can be mapped to axes or other attributes such as
colours, groups, sizing etc.
Geometry (‘geom’): The type of plot you would
like (e.g., geom_point,
geom_line, geom_boxplot)
So, at it’s most basic we can plot:
ggplot(gapminder, aes(
x = gdpPercap,
y = lifeExp)) +
geom_point()
Other attributes, such as axis transformations, plot titles, legend
adjustments, themes, colours and gridlines are layered over the initial
plot using a + followed by the additional lines of
code.
ggplot(gap_2007, aes(
x = gdpPercap,
y = lifeExp,
color = continent)) +
geom_point() +
scale_x_log10() +
labs(
title = "Life Expectancy vs GDP per Capita",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
)
There are preset themes that can be added to a plot such as:
theme_minimal(), theme_classic(),
theme_dark(). Helpfully there is an NHSBSA ggplot theme
within the nhsbsaR package.
ggplot(gap_2007, aes(
x = gdpPercap,
y = lifeExp,
color = continent
)) +
geom_point(alpha = 1) +
scale_x_log10() +
labs(
title = "Life Expectancy vs GDP per Capita",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
) +
nhsbsaR::theme_nhsbsa_gg()
We can use scale_color_manual() to customise to our
NHSBSA custom colour scheme:
ggplot(gap_2007, aes(
x = gdpPercap,
y = lifeExp,
color = continent
)) +
geom_point(alpha = 0.6, size = 2) +
scale_x_log10() +
labs(
title = "Life Expectancy vs GDP per Capita",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
) +
nhsbsaR::theme_nhsbsa_gg() +
scale_color_manual(values = c(
"Asia" = "#003087", # NHS Blue
"Europe" = "#009639", # NHS Green
"Africa" = "#7C2855", # NHS Purple
"Americas" = "#FFB81C", # NHS Yellow
"Oceania" = "#00A9CE" # NHS Teal
))
Another useful function within ggplot2 is the faceting
functionality. If we return to the unfiltered gapminder dataset, that
includes all years we can really quickly create the above scatterplot
for each year in our dataset by using facet_wrap().
ggplot(gapminder, aes(
x = gdpPercap,
y = lifeExp,
color = continent
)) +
geom_point(alpha = 0.6) +
scale_x_log10() +
facet_wrap(~ year) +
labs(
title = "Life Expectancy vs GDP per Capita Over Time",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
) +
nhsbsaR::theme_nhsbsa_gg() +
scale_color_manual(values = c(
"Asia" = "#003087", # NHS Blue
"Europe" = "#009639", # NHS Green
"Africa" = "#7C2855", # NHS Purple
"Americas" = "#FFB81C", # NHS Yellow
"Oceania" = "#00A9CE" # NHS Teal
))
There are so many different things you can do with
ggplot2, it is very flexible. If you care to have a look
there are r-charts gallery pages here to give some
inspiration.
The one downside to ggplot2 is that the plots are
static, which is great for print publications, however publishing
reports online gives us the opportunity to make our plots interactive.
Functions like hover tooltips, zooming, panning and using a mouse drag
to select data can make a visual much more engaging. The plotly package
is an R graphics package designed just for this, and the associated
package ggplotly allows really quick conversion of a ggplot visual to
make it interactive.
Starting with one of our previous ggplot2 plots of gdp
vs life expectancy we can simply add one more line of code to make the
plot interactive, with tooltips, select, zoom and pan functionality.
p <- ggplot(gap_2007, aes(
x = gdpPercap,
y = lifeExp,
color = continent
)) +
geom_point(alpha = 0.6, size = 2) +
scale_x_log10() +
labs(
title = "Life Expectancy vs GDP per Capita (2007)",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
) +
nhsbsaR::theme_nhsbsa_gg() +
scale_color_manual(values = c(
"Asia" = "#003087", # NHS Blue
"Europe" = "#009639", # NHS Green
"Africa" = "#7C2855", # NHS Purple
"Americas" = "#FFB81C", # NHS Yellow
"Oceania" = "#00A9CE" # NHS Teal
))
p
ggplotly(p)
The tooltips could do with some formatting control however. To do
this we can pass some instructions to ggplotly using the
text argument within the ggplot2 aesthetics:
p <- ggplot(gap_2007, aes(
x = gdpPercap,
y = lifeExp,
color = continent,
text = paste0(
"Country: ", country, "<br>",
"Continent: ", continent, "<br>",
"GDP per Capita: ", round(gdpPercap, 2), "<br>",
"Life Expectancy: ", round(lifeExp, 1)
)
)) +
geom_point(alpha = 0.6, size = 2) +
scale_x_log10() +
labs(
title = "Life Expectancy vs GDP per Capita (2007)",
x = "GDP per Capita (log scale)",
y = "Life Expectancy"
) +
nhsbsaR::theme_nhsbsa_gg() +
scale_color_manual(values = c(
"Asia" = "#003087", # NHS Blue
"Europe" = "#009639", # NHS Green
"Africa" = "#7C2855", # NHS Purple
"Americas" = "#FFB81C", # NHS Yellow
"Oceania" = "#00A9CE" # NHS Teal
))
# when running the ggplot portion of this code we might get a warning that it is not using the text argument we just added. Not a problem though, ggplotly will use it.
ggplotly(p, tooltip = "text")
plotly can be used as a standalone graphics package.
However, ggplotly conversion of ggplot2 plots
is a great option if you want simple interactivity upgrade to a
ggplot graphic but don’t want to learn the syntax for both
packages! The only thing is that you tend to lose your NHS formatting
with the addition of ggplotly, which is a bit annoying as
you need to specify some things again:
ggplotly(p, tooltip = "text") |>
layout(
legend = list( # Move the legend to be horizontal and centered at the top
orientation = "h",
x = 0.5,
y = 1.08,
xanchor = "center",
font = list(family = "Arial", size = 12)
),
margin = list(t = 100), # Add top margin for legend/title space
title = list(
font = list(family = "Arial", size = 20), # Center the title and change font
x = 0.5,
xanchor = "center"
)
)
An alternative to plotly / ggplotly that
works very well with shiny applications is
highcharter. It is an R wrapper for the Highcharts
JavaScript library. The highcharter syntax is similar to
ggplot2 in that it follows a similar ‘Grammar of Graphics’
structure, but there are some differences.
highchart() |>
hc_add_series(data = gap_2007, # Similar to ggplot2, add data and aesthetics to your plot
type = "scatter", # Specify chart type
hcaes(x = gdpPercap, y = lifeExp, group = continent, name = country)) |>
hc_title(text = "Life Expectancy vs GDP per Capita (2007)") |> # Add main chart title
hc_xAxis(type = "logarithmic",
title = list(text = "GDP per Capita (log scale)")) |> # Use a log scale for the x-axis and give it a title
hc_yAxis(title = list(text = "Life Expectancy")) |> # Give the y-axis a title
hc_plotOptions(
scatter = list(
marker = list(
symbol = "circle" # Set all symbols to be circles
)
)
)|>
hc_tooltip( # Customise formatting for tooltips
useHTML = TRUE,
pointFormat = paste0(
"<b>Country:</b> {point.name}<br>",
"<b>GDP per Capita:</b> {point.x:.2f}<br>",
"<b>Life Expectancy:</b> {point.y:.2f}"
)
) |>
nhsbsaR::theme_nhsbsa_highchart(stack = FALSE)|>
hc_colors(c("#003087", "#009639", "#7C2855", "#FFB81C", "#00A9CE")) |> # specify NHS colurs from palette
hc_chart(zoomType = "xy") # enable zooming
| Method | Pros | Cons |
|---|---|---|
| Base R | Quick, great for having a quick look at the data | Customisation options are basic, not ideal if you want to quickly create a ‘polished’ looking plot |
| ggplot2 | Really customisable and extensible, nhsbsa theming available, very well documented online | Static by default but can convert using plotly |
| ggplotly / plotly | Easy to convert a ggplot2 visual to add interactivity, handles interactive 3D plots, very well adopted so well documented online | Can lose theme styling when converting from ggplot2. Limited layout control. Syntax less user friendly than ggplot2 |
| highcharter | Plots look polished by default, can easily apply nhsbsa theme, integrates well with shiny apps | Full customisation requires a bit of JavaScript knowledge. No 3D functionality. Limited in terms of statistical layers that you can add. |
Some very quick pointers for best practice when plotting data:
Try to:
Avoid:
Misleading axes (e.g. truncated y-axis)
Overplotting (use transparency controls or jitter)
Overcomplicating a single plot - sometimes two simple charts together can be better than one more complex one.
Ideally, the plot should be able to stand alone and be understood without any accompanying text, although when it comes to accessibility this isn’t always possible. Having a text description of the key insights can help here.
Graham’s past C&C: Data Visualistation with ggplot2
Adnan’s past C&C: Highcharter Introduction